Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data

نویسندگان

چکیده

AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity sharing. This reflected by growing availability of both commercial open-sourced software solutions synthesizing private data. However, despite these recent advances, adequately evaluating quality generated synthetic datasets still an open challenge. We aim close this gap introduce a novel holdout-based empirical assessment framework quantifying fidelity as well privacy risk mixed-type tabular Measuring based on statistical distances lower-dimensional marginal distributions, which provide model-free easy-to-communicate metric representativeness dataset. Privacy assessed calculating individual-level closest record with respect training By showing that samples are just holdout data, we yield strong evidence synthesizer indeed learned generalize patterns independent individual records. empirically demonstrate presented seven distinct across four compare then traditional perturbation techniques. Both Python-based implementation proposed metrics demonstration study setup made available open-source. The results highlight need systematically assess emerging class generators.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

modeling loss data by phase-type distribution

بیمه گران همیشه بابت خسارات بیمه نامه های تحت پوشش خود نگران بوده و روش هایی را جستجو می کنند که بتوانند داده های خسارات گذشته را با هدف اتخاذ یک تصمیم بهینه مدل بندی نمایند. در این پژوهش توزیع های فیزتایپ در مدل بندی داده های خسارات معرفی شده که شامل استنباط آماری مربوطه و استفاده از الگوریتم em در برآورد پارامترهای توزیع است. در پایان امکان استفاده از این توزیع در مدل بندی داده های گروه بندی ...

Empirical Exploitation of Click Data for Query-Type-Based Ranking

In machine-learning-based ranking, each category of queries can be applied with a specific ranking function, which is called query-type-based ranking. Such a divideand-conquer strategy can potentially provide better ranking function for each query categories. A critical problem for the query-type-based ranking is training data insufficiency, which may be solved by using the data extracted from ...

متن کامل

Outlier Detection on Mixed-Type Data: An Energy-Based Approach

Outlier detection amounts to finding data points that differ significantly from the norm. Classic outlier detection methods are largely designed for single data type such as continuous or discrete. However, real world data is increasingly heterogeneous, where a data point can have both discrete and continuous attributes. Handling mixed-type data in a disciplined way remains a great challenge. I...

متن کامل

New Shewhart-type synthetic bar{X} control schemes for non-normal data

In this paper, Burr-type XII ̄X synthetic schemes are proposed as an alternative to the classical ̄X synthetic schemes when the assumption of normality fails to hold. First, the basic design of the Burr-type XII ̄X synthetic scheme is developed and its performance investigated using exact formulae. Secondly, the non-side-sensitive and side-sensitive Burr-type XII ̄X synthetic schemes are int...

متن کامل

The Effect of Corruption on Shadow Economy: An Empirical Analysis Based on Panel Data

Quite often shadow economy (SE) and corruption are seen as "twins", which need each other or fight against each other and theoretically can be either complements or substitutes. Therefore, the relationship between SE and corruption has been a controversial and polemical issue and in the spotlight of a remarkable collection of economists and social researchers. The main objective of this study ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Frontiers in big data

سال: 2021

ISSN: ['2624-909X']

DOI: https://doi.org/10.3389/fdata.2021.679939